Code
import pandas as pd
import plotly.express as px
import plotly.io as pioTeam 8
Jianhao Hong
Boston University
Xinran Li
Boston University
Chialing Sung
Boston University
Zimo Zeng
Boston University
This section presents enhanced exploratory data analysis (EDA) to uncover meaningful insights from job posting data. By combining statistical summaries with polished visualizations, we aim to clarify trends in labor market demand, salary distribution, and remote work preferences. These insights are critical for understanding industry dynamics, guiding policy, and informing personal career decisions.
import pandas as pd
import plotly.express as px
import plotly.io as pio
# Set global Plotly theme
pio.templates.default = "plotly_white"
# Load and clean data
df = pd.read_csv("/home/ubuntu/ad688-employability-sp25A1-group8-1/job_postings.csv")
df.columns = df.columns.str.upper()
# Prepare data
industry_df = df["NAICS_2022_6_NAME"].value_counts().nlargest(30).sort_values(ascending=True)
industry_df = industry_df.to_frame(name="Postings").reset_index()
industry_df.columns = ["Industry", "Postings"]
# Create bar chart with pastel tones
fig = px.bar(
industry_df,
x="Postings",
y="Industry",
orientation="h",
title="Top 30 Industries by Job Postings",
color="Postings",
color_continuous_scale="Mint",
height=900,
width=1000,
text="Postings"
)
# Layout enhancements
fig.update_layout(
title_font=dict(size=22, family="Arial", color="#1f77b4"),
font=dict(family="Arial", size=13, color="#333"),
xaxis_title="Number of Postings",
yaxis_title="",
xaxis_tickformat=",",
margin=dict(l=250, r=50, t=80, b=50),
coloraxis_showscale=False
)
# Show text labels on bars
fig.update_traces(
textposition="outside",
cliponaxis=False
)
# Save files
fig.write_image("_output/eda_1_top_industries.png")
try:
fig.write_image("_output/job_postings_by_industry.png")
except Exception as e:
print("PNG export failed:", e)
fig.show()The bar chart visually compares job posting volumes across industries, helping stakeholders quickly identify high-demand sectors. It uses descending order, clean labeling, and a consistent color theme to ensure clarity and interpretability.
πΉ Clarifies Labor Market Demand
- Identifies High-Demand Sectors: The top industries include Custom Computer Programming Services, Administrative Management, and Employment Placement Agencies, indicating strong demand in tech and business consulting.
- Highlights Data Classification Gaps: The high count of βUnclassified Industryβ postings may reflect inconsistent data labeling or gaps in employer input.
- Reveals Structural Shifts: Industries such as retail, telecom, and administration appear lower on the list, suggesting a trend of automation and shifting consumer behavior.
πΉ Enables Career Planning
- Supports Strategic Reskilling: Students and job seekers can realign learning goals with industries that show the most activity.
- Guides Workforce Programs: Educational institutions and policy makers can tailor upskilling programs for booming sectors like tech, finance, and healthcare.
πΉ Improves Visualization Communication
- Clear Comparison: The horizontal bar chart presents a side-by-side ranking thatβs easy to digest.
- Stakeholder-Friendly: A useful resource for employers, schools, and analysts to drive strategy and decision-making based on real-time industry demand.
import pandas as pd
import plotly.express as px
import plotly.io as pio
# Set custom Plotly theme
pio.templates["custom"] = pio.templates["plotly_white"].update({
"layout": {
"font": {"family": "Arial", "size": 14, "color": "#333"},
"title": {"x": 0.05, "font": {"size": 22, "color": "#1f77b4"}},
"paper_bgcolor": "white",
"plot_bgcolor": "white",
}
})
pio.templates.default = "custom"
# Ensure salary column exists
df["SALARY"] = pd.to_numeric(df.get("POSTED_SALARY", df.get("SALARY", None)), errors="coerce")
# Filter top 20 industries with valid salary data
top_20_industries = df["NAICS_2022_6_NAME"].value_counts().head(20).index
df_salary = df[df["NAICS_2022_6_NAME"].isin(top_20_industries) & df["SALARY"].notnull()]
# Calculate median salary and sort
median_salary = (
df_salary.groupby("NAICS_2022_6_NAME")["SALARY"]
.median()
.sort_values()
.reset_index()
.rename(columns={"SALARY": "Median"})
)
# Merge and set category order
df_salary = df_salary.merge(median_salary, on="NAICS_2022_6_NAME")
df_salary["NAICS_2022_6_NAME"] = pd.Categorical(
df_salary["NAICS_2022_6_NAME"],
categories=median_salary["NAICS_2022_6_NAME"],
ordered=True
)
# Define RdYlBu-like diverging color sequence (20 colors)
rdyblu_palette = px.colors.diverging.RdYlBu[:20]
# Create Plotly boxplot
fig = px.box(
df_salary,
x="SALARY",
y="NAICS_2022_6_NAME",
color="NAICS_2022_6_NAME",
color_discrete_sequence=rdyblu_palette,
points=False,
title="Salary Distribution by Industry (Top 20)",
height=900,
width=1100
)
# Layout customization
fig.update_layout(
showlegend=False,
xaxis_title="Salary (USD)",
yaxis_title="Industry",
margin=dict(l=250, r=50, t=80, b=50)
)
# Format x-axis
fig.update_xaxes(tickformat="$,~s")
# Save
fig.write_image("_output/eda_2_salary_distribution_by_industry_plotly.png")
fig.show()This horizontal boxplot provides a comprehensive view of salary ranges across the top 20 industries. The consistent pastel color palette and sorted layout enhance readability while enabling comparison of central tendencies and variation in compensation.
πΉ Highlights Compensation Gaps
- Identifies High-Paying Fields: Sectors like Web Search Portals, Administrative Management, and Certified Public Accountants demonstrate significantly higher median salaries.
- Visualizes Salary Spread: The boxplot format allows users to assess industry-level variation, skewness, and the presence of salary outliers.
πΉ Supports Career and Salary Planning
- Informs Job Seekers: Individuals can set more realistic salary expectations and prioritize sectors with better earning potential.
- Encourages Upskilling: Fields with broad salary distributions often reward advanced skills, certifications, or specialization.
πΉ Improves Employer Benchmarking
- Supports Compensation Strategy: Businesses can evaluate whether their offered salaries are competitive within their industry.
- Guides Policy Analysis: Policymakers and analysts can use this to assess pay equity across different domains and job categories.
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
# Clean and prepare data
df_filtered = df[df["REMOTE_TYPE_NAME"].notna()]
df_filtered = df_filtered[df_filtered["REMOTE_TYPE_NAME"] != "[None]"]
remote_counts = df_filtered["REMOTE_TYPE_NAME"].value_counts().reset_index()
remote_counts.columns = ["Remote Type", "Count"]
# Calculate percentage
remote_counts["Percent"] = remote_counts["Count"] / remote_counts["Count"].sum() * 100
remote_counts["Label"] = remote_counts.apply(
lambda row: f"{row['Remote Type']}<br>{row['Percent']:.1f}% ({row['Count']})", axis=1
)
# Plotly Donut Pie Chart
fig = go.Figure(
data=[go.Pie(
labels=remote_counts["Remote Type"],
values=remote_counts["Count"],
hole=0.55,
marker=dict(colors=["#F77F00", "#F9A03F", "#FFE29A"], line=dict(color="white", width=2)),
text=remote_counts["Label"],
hoverinfo="text",
textinfo="label+percent",
textposition="outside"
)]
)
# Layout
fig.update_layout(
title="π Remote Work Types (Excluding Unspecified)",
title_font_size=20,
font=dict(family="Arial", size=14, color="#333"),
margin=dict(l=50, r=50, t=80, b=50),
showlegend=False
)
# Save
fig.write_image("_output/eda_3_remote_work_types_pie_plotly.png")
# Show
fig.show()A donut-style pie chart provides a compact and intuitive view of remote work distribution. It effectively communicates the proportion of fully remote, hybrid, and on-site jobs using color distinction and labeled percentages, making it highly accessible for both technical and non-technical audiences.
πΉ Reflects Modern Work Preferences
- Tracks Remote Work Trends: Shows the shift toward flexible work environments in the job market.
- Supports Talent Strategy: Employers can adjust remote policies based on whatβs prevalent in the broader market.
πΉ Helps Job Matching
- Informs Job Seekers: Helps individuals choose jobs based on lifestyle and location flexibility.
- Supports Equity & Accessibility: Remote jobs enable access for rural or underserved populations, reducing geographic barriers.
πΉ Improves Visual Simplicity
- Compact Presentation: Donut chart gives a clear breakdown in one glance.
- Enhances Dashboard Readability: Ideal for HR and workforce dashboards to display high-level trends for decision-making.